In [171]:
#!pip install pandas-profiling
In [172]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
import pandas_profiling
In [173]:
for f in os.listdir():
    print(f.ljust(30) +"--" + str(round(os.path.getsize(f) / 1000000, 2)) + 'MB')
.ipynb_checkpoints            --0.0MB
682483_Connected Cars - Assessment_16_26_52(GMT +0530).pdf--8.71MB
Bank_Personal_Loan_Modelling.csv--0.21MB
InstructionsToOnlineClasses.docx--0.04MB
Project_BankPersonalLoan_Manoj.ipynb--9.3MB
Project_Bank_PersonalLoan.html--9.41MB
Project_Bank_PersonalLoan.ipynb--9.08MB
Supervised Learning Problem Statement-1.pdf--0.45MB
In [174]:
df=pd.read_csv('Bank_Personal_Loan_Modelling.csv')
df.shape
Out[174]:
(5000, 14)
  • Data set has 5000 rows with 14 columns
In [175]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
  • All Data types are int/float - 13 columns
  • Categorical col: ID, ZIP Code, Family, Education, Securities Account, CD Account, Online,CreditCard
  • Numerical Coulmns: Age, Experience, Income, CCAvg, Mortgage
  • Dependent Variable/Target Column: Personal Loan
In [176]:
df.head()
Out[176]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [177]:
df.describe(include='all')
Out[177]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93152.503000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 1443.520003 11.463166 11.467954 46.033729 2121.852197 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 1.000000 23.000000 -3.000000 8.000000 9307.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000
  • ID: all unique values in incremental steps. Customer ID's.
  • Age: range from 23-69, mean and avg were close around 45. 25%,50% and 75% shows steady increase in value suggests flat like distribution.
  • Experience: Min value appeared as negative which needs deeper look and cleaning as it should not be less than 0.
  • Income: range from 8-224K, Outliers towards higher income range. Mean is higher than 50 percentile suggests skewness
  • Zipcode/Family/Securities Account/CD Account/Online/CreditCard: categorical values. These values looks good and no cleaning is required as per available details.
In [178]:
df.profile_report()



Out[178]:

  • Experience: min values appeared as negative, need more analysis on that as Experience should be not be below 0. Values are spreaded and multiple peaks are present.
  • Age: Values are spread across ages, multiple peaks appears in the distribution.
  • Income: values are skewed. positive skewness. Lot of outliers towrds higher income.
  • Family: Distribution is almost similar but family sizes with 1 or 2 are comparatively more in distribution than having 3 or 4.
  • CCAvg: distribution is skewed (positive).Similar like Income.
  • Education: Presence of Undergrad is more in the distribution
  • Mortgage: Skewed data(positive). Majority have mortgage of less value
  • Personal Loan/Securities Account/CD Account: Less persons have opted for these options
  • Online: more than 59% cust have opted of online
  • CreditCard: close to 30% customer has creditcard
In [179]:
df[df['Experience']<0].describe()
Out[179]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
count 52.000000 52.000000 52.000000 52.000000 52.000000 52.000000 52.000000 52.000000 52.000000 52.0 52.000000 52.0 52.000000 52.000000
mean 2427.346154 24.519231 -1.442308 69.942308 93240.961538 2.865385 2.129423 2.076923 43.596154 0.0 0.115385 0.0 0.576923 0.288462
std 1478.834118 1.475159 0.639039 37.955295 1611.654806 0.970725 1.750562 0.836570 90.027068 0.0 0.322603 0.0 0.498867 0.457467
min 90.000000 23.000000 -3.000000 12.000000 90065.000000 1.000000 0.200000 1.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000
25% 767.250000 24.000000 -2.000000 40.750000 92167.750000 2.000000 1.000000 1.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000
50% 2783.500000 24.000000 -1.000000 65.500000 93060.000000 3.000000 1.800000 2.000000 0.000000 0.0 0.000000 0.0 1.000000 0.000000
75% 3669.500000 25.000000 -1.000000 86.750000 94720.000000 4.000000 2.325000 3.000000 0.000000 0.0 0.000000 0.0 1.000000 1.000000
max 4958.000000 29.000000 -1.000000 150.000000 95842.000000 4.000000 7.200000 3.000000 314.000000 0.0 1.000000 0.0 1.000000 1.000000
In [180]:
df[df['Experience']<0].Experience.value_counts()
Out[180]:
-1    33
-2    15
-3     4
Name: Experience, dtype: int64
  • All the negative values are for younger populatin age from 23-29
  • All are majorly Undergrads in Education
  • Total 52 records are impacted
  • Total 3 negative values -1,-2,-3
  • Need to clean the column
  • from Profile report we can see Experience has high correlation with Age, need in depth analysis for correlation and values
In [181]:
plt.figure(figsize=(16,8))
sns.heatmap(df.corr(), annot=True)
print('Experience and Age shows very high correlation, will try to replace negative value from same age grop in the fields')
print('Experience has relation with Education as well but we dont see much on heatmap.')
Experience and Age shows very high correlation, will try to replace negative value from same age grop in the fields
Experience has relation with Education as well but we dont see much on heatmap.
In [182]:
df[df['Experience']== -1].Age.value_counts()
Out[182]:
25    17
24     6
23     6
29     3
26     1
Name: Age, dtype: int64
In [183]:
df[df['Experience']== -1].Education.value_counts()
Out[183]:
3    12
1    11
2    10
Name: Education, dtype: int64
In [184]:
df[df['Experience']== -2].Age.value_counts()
Out[184]:
24    9
23    4
28    1
25    1
Name: Age, dtype: int64
In [185]:
df[df['Experience']== -3].Age.value_counts()
Out[185]:
23    2
24    2
Name: Age, dtype: int64
In [186]:
df[df['Age']== 23].Experience.value_counts()
Out[186]:
-1    6
-2    4
-3    2
Name: Experience, dtype: int64
In [187]:
df[df['Age']== 23].Education.value_counts()
Out[187]:
1    7
2    5
Name: Education, dtype: int64
In [188]:
df[df['Age']== 24].Experience.value_counts()
Out[188]:
 0    11
-2     9
-1     6
-3     2
Name: Experience, dtype: int64
In [189]:
df[df['Age']== 24].Education.value_counts()
Out[189]:
1    13
2     8
3     7
Name: Education, dtype: int64
  • For age 23 & 24 it seems all values are negative, we can replace Experience to 0 for age 23 & 24 considering the only nearest positive value they got up to is 0.
  • Considering the education wont affect the result
In [190]:
df.loc[(df['Age']== 23),'Experience']=0
In [191]:
df.loc[(df['Age']== 24),'Experience']=0
In [192]:
df[df['Experience']== -2].Age.value_counts()
Out[192]:
25    1
28    1
Name: Age, dtype: int64
  • Experience -3 is gone and for -2 only age 25 and 28 are left. we will get mean/median of positive values to replace these ages
In [193]:
df[df['Experience']== -2]
Out[193]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
451 452 28 -2 48 94132 2 1.75 3 89 0 0 0 1 0
4481 4482 25 -2 35 95045 4 1.00 3 0 0 0 0 1 0
In [194]:
print(df[(df['Age']== 25) & (df['Education']== 3) & (df['Experience']> -1)].Experience.describe())
print('\nWe will replace the values with median of 0 for age 25, considering the age and eduction of 3')
count    9.000000
mean     0.111111
std      0.333333
min      0.000000
25%      0.000000
50%      0.000000
75%      0.000000
max      1.000000
Name: Experience, dtype: float64

We will replace the values with median of 0 for age 25, considering the age and eduction of 3
In [195]:
df.loc[((df['Age']== 25) & (df['Education']== 3) & (df['Experience']< 0)),'Experience']=0
In [196]:
print(df[(df['Age']== 28) & (df['Education']== 3) & (df['Experience']> -1)].Experience.describe())
print('\nWe will replace the values with median of 0 for age 28, considering the age and eduction of 3')
count    21.000000
mean      2.619048
std       0.669043
min       2.000000
25%       2.000000
50%       3.000000
75%       3.000000
max       4.000000
Name: Experience, dtype: float64

We will replace the values with median of 0 for age 28, considering the age and eduction of 3
In [197]:
df.loc[((df['Age']== 28) & (df['Education']== 3) & (df['Experience']== -2)),'Experience']=3
In [198]:
df[df['Experience']== -2].Age.count()
Out[198]:
0
  • negative values are gone for -3 and -2, and only -1 is left
In [199]:
df[df['Experience']== -1].Age.value_counts()
Out[199]:
25    8
29    3
26    1
Name: Age, dtype: int64
In [200]:
df[df['Experience']== -1]
Out[200]:
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard
536 537 25 -1 43 92173 3 2.40 2 176 0 0 0 1 0
1428 1429 25 -1 21 94583 4 0.40 1 90 0 0 0 1 0
1905 1906 25 -1 112 92507 2 2.00 1 241 0 0 0 1 0
2545 2546 25 -1 39 94720 3 2.40 2 0 0 0 0 1 0
2980 2981 25 -1 53 94305 3 2.40 2 0 0 0 0 0 0
3076 3077 29 -1 62 92672 2 1.75 3 0 0 0 0 0 1
3279 3280 26 -1 44 94901 1 2.00 2 0 0 0 0 0 0
3292 3293 25 -1 13 95616 4 0.40 1 0 0 1 0 0 0
3946 3947 25 -1 40 93117 3 2.40 2 0 0 0 0 1 0
4015 4016 25 -1 139 93106 2 2.00 1 0 0 0 0 0 1
4088 4089 29 -1 71 94801 2 1.75 3 0 0 0 0 0 0
4957 4958 29 -1 50 95842 2 1.75 3 0 0 0 0 0 1
In [201]:
print(df[(df['Age']== 25) & (df['Education']==1) & (df['Experience']> -1)].Experience.describe())
print(df[(df['Age']== 25) & (df['Education']==2) & (df['Experience']> -1)].Experience.describe())
print('We will replace the values with 50% of 1 for age 25 and edu - 2,1')
print(df[(df['Age']== 26) & (df['Education']== 2) & (df['Experience']> -1)].Experience.describe())
print('We will replace the values with median of 1 for age 26 and edu - 2')
print(df[(df['Age']== 29) & (df['Education']== 3) & (df['Experience']> -1)].Experience.describe())
print('We will replace the values with median of 3 for age 29 and edu 3')
count    19.000000
mean      0.842105
std       0.374634
min       0.000000
25%       1.000000
50%       1.000000
75%       1.000000
max       1.000000
Name: Experience, dtype: float64
count    7.000000
mean     0.142857
std      0.377964
min      0.000000
25%      0.000000
50%      0.000000
75%      0.000000
max      1.000000
Name: Experience, dtype: float64
We will replace the values with 50% of 1 for age 25 and edu - 2,1
count    23.000000
mean      0.826087
std       0.777652
min       0.000000
25%       0.000000
50%       1.000000
75%       1.000000
max       2.000000
Name: Experience, dtype: float64
We will replace the values with median of 1 for age 26 and edu - 2
count    32.00000
mean      3.50000
std       1.04727
min       0.00000
25%       3.00000
50%       3.00000
75%       4.00000
max       5.00000
Name: Experience, dtype: float64
We will replace the values with median of 3 for age 29 and edu 3
In [202]:
df.loc[((df['Age']== 29) & (df['Experience']< 0)),'Experience']=3
df.loc[((df['Age']== 26) & (df['Experience']< 0)),'Experience']=1
df.loc[((df['Age']== 25) & (df['Experience']< 0)),'Experience']=1
In [203]:
df[df['Experience']<0].Experience.count()
Out[203]:
0

No more negative values left for Experience

In [ ]:
 
In [ ]:
 
In [ ]:
 

Looking for Null values

In [204]:
df.isna().sum()
Out[204]:
ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64
In [205]:
df.isnull().sum()
Out[205]:
ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64
  • No Null/Na values appears in the columns

Analyzing Columns with Personal Loan

In [206]:
sns.pairplot(df,hue='Personal Loan',diag_kind='hist')
Out[206]:
<seaborn.axisgrid.PairGrid at 0x2c90c86f048>
In [207]:
df_select=df.drop(['ID','ZIP Code','Securities Account','CD Account','Online','CreditCard'],axis=1)
#dropping few cols to get a better picture
In [ ]:
 
In [208]:
sns.pairplot(df_select)
Out[208]:
<seaborn.axisgrid.PairGrid at 0x2c90b962ec8>
In [209]:
print("with first glance can clearly notice high correlation with age and experience, income and CCAvg, mortgage with Income")
with first glance can clearly notice high correlation with age and experience, income and CCAvg, mortgage with Income
In [210]:
df_select=df_select.astype({"Personal Loan":str})
In [211]:
sns.pairplot(df_select,hue="Personal Loan")
plt.show()
print('''For Age/Experience, Loan distribution looks like evenly distributed,
For Income, shows high presence on loan acceptance for high income group
For Family, have comparativly higer presne of loan acceptance in family members with 3 or 4
For CCAvg,  higer CCAvg group members have higher chances of acceptance of Loan
For Education, Loan acceptance grew with education level
For Mortgage, doenst provide much data from this view''')
For Age/Experience, Loan distribution looks like evenly distributed,
For Income, shows high presence on loan acceptance for high income group
For Family, have comparativly higer presne of loan acceptance in family members with 3 or 4
For CCAvg,  higer CCAvg group members have higher chances of acceptance of Loan
For Education, Loan acceptance grew with education level
For Mortgage, doenst provide much data from this view
In [212]:
sns.distplot(df['Age'])
Out[212]:
<AxesSubplot:xlabel='Age'>
In [213]:
print(df['Personal Loan'].value_counts())
print('Persons taking loan are: 9.6% ')
0    4520
1     480
Name: Personal Loan, dtype: int64
Persons taking loan are: 9.6% 
In [214]:
#ID has no relation with anything hence skipping it.
In [215]:
#Age
In [216]:
#What age group is taking loan
print(df[df['Personal Loan']==1].Age.describe())
print('''\nAge looks distributed across the people taking personal loan.
      Majority of cutomers are 35-65 considering 25%.
      ''')
count    480.000000
mean      45.066667
std       11.590964
min       26.000000
25%       35.000000
50%       45.000000
75%       55.000000
max       65.000000
Name: Age, dtype: float64

Age looks distributed across the people taking personal loan.
      Majority of cutomers are 35-65 considering 25%.
      
In [217]:
#distribution of the age group
sns.distplot(df[df['Personal Loan']==1].Age,bins=10,color='r',rug=True)
sns.distplot(df[df['Personal Loan']==0].Age,bins=10,color='g',rug=True)
plt.show()
In [218]:
#distribution of the age group
#have to make kde false else for the kde representaion its showing both graphs on same scaled level
sns.distplot(df[df['Personal Loan']==0]['Age'],bins=10,color='g',kde=False,rug=True);
sns.distplot(df[df['Personal Loan']==1]['Age'],bins=10,color='r',kde=False,rug=True);

plt.show();
print(''' Personal Loan customers looks distributed across the age group''')
 Personal Loan customers looks distributed across the age group
In [219]:
#Experience
In [220]:
print(df[df['Personal Loan']==1].Experience.describe())
print('''\nAge and Experience are highly correlated and expecting same behaviour with Loan as well.
      ''')
count    480.000000
mean      19.843750
std       11.582443
min        0.000000
25%        9.000000
50%       20.000000
75%       30.000000
max       41.000000
Name: Experience, dtype: float64

Age and Experience are highly correlated and expecting same behaviour with Loan as well.
      
In [221]:
sns.distplot(df[df['Personal Loan']==1].Experience,bins=5)
Out[221]:
<AxesSubplot:xlabel='Experience'>
In [222]:
sns.distplot(df[df['Personal Loan']==0]['Experience'],bins=10,label='Loan 0',kde=False,rug=True);
sns.distplot(df[df['Personal Loan']==1]['Experience'],bins=10,label='Loan 1',kde=False,rug=True);

plt.legend();
plt.show();

print(''' Personal Loan customers looks distributed across the experience range as in age.''')
 Personal Loan customers looks distributed across the experience range as in age.
In [223]:
#Income
In [224]:
print(df[df['Personal Loan']==1].Income.describe())
print(df[df['Personal Loan']==0].Income.describe())
print('''\n Income for cust taken personal loan are ranging from 60-203. For others its ranging from 8-224.
This details suggests the personal loan is usually not accepted by income group below 60k 
and considering the 25% value of 122K, there is higher chance of accepatnce of 
Personal loan for income group above 122K. 

      ''')
count    480.000000
mean     144.745833
std       31.584429
min       60.000000
25%      122.000000
50%      142.500000
75%      172.000000
max      203.000000
Name: Income, dtype: float64
count    4520.000000
mean       66.237389
std        40.578534
min         8.000000
25%        35.000000
50%        59.000000
75%        84.000000
max       224.000000
Name: Income, dtype: float64

 Income for cust taken personal loan are ranging from 60-203. For others its ranging from 8-224.
This details suggests the personal loan is usually not accepted by income group below 60k 
and considering the 25% value of 122K, there is higher chance of accepatnce of 
Personal loan for income group above 122K. 

      
In [225]:
sns.distplot(df[df['Personal Loan']==0]['Income'],kde=False,label='Loan 0',rug=True);
sns.distplot(df[df['Personal Loan']==1]['Income'],kde=False,label='Loan 1',rug=True);

plt.legend();
plt.show();
print('''from this image we can assume the personal loan has more cahnce of acceptance towards higher income group.
From above details as well we can notice the distribution of customers taking loan is higher above 122K income.
The extream right and left of graph suggests these grousp are not so inclienced towards the loan.
There is a very high rate of conversion for income group at range from 150-200.
''')
from this image we can assume the personal loan has more cahnce of acceptance towards higher income group.
From above details as well we can notice the distribution of customers taking loan is higher above 122K income.
The extream right and left of graph suggests these grousp are not so inclienced towards the loan.
There is a very high rate of conversion for income group at range from 150-200.

In [226]:
#Family
In [227]:
print(df[df['Personal Loan']==1].Family.describe())
print(df[df['Personal Loan']==0].Family.describe())
count    480.000000
mean       2.612500
std        1.115393
min        1.000000
25%        2.000000
50%        3.000000
75%        4.000000
max        4.000000
Name: Family, dtype: float64
count    4520.000000
mean        2.373451
std         1.148771
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         4.000000
Name: Family, dtype: float64
In [228]:
sns.countplot(data=df,x='Family',hue='Personal Loan')
plt.show();
print('''
Personal Loan customer looks distributed in all range of family but family memeber with 3 and 4 have higher presence compared to others.
''')
Personal Loan customer looks distributed in all range of family but family memeber with 3 and 4 have higher presence compared to others.

In [229]:
#CCAvg
In [230]:
print(df[df['Personal Loan']==1].CCAvg.describe())
print(df[df['Personal Loan']==0].CCAvg.describe())

print('''
Personal Loan cust are distributed but cust having higher CCAvg (>8.8) have higer acceptance towards personal Loan.
Cust not taking Loan have a max of 8.8 compared to 10 in cust taking Loan.
Cust having CCAvg ranging from 2.6-5.35 makes 50% of customers taking loan.
''')
count    480.000000
mean       3.905354
std        2.097681
min        0.000000
25%        2.600000
50%        3.800000
75%        5.347500
max       10.000000
Name: CCAvg, dtype: float64
count    4520.000000
mean        1.729009
std         1.567647
min         0.000000
25%         0.600000
50%         1.400000
75%         2.300000
max         8.800000
Name: CCAvg, dtype: float64

Personal Loan cust are distributed but cust having higher CCAvg (>8.8) have higer acceptance towards personal Loan.
Cust not taking Loan have a max of 8.8 compared to 10 in cust taking Loan.
Cust having CCAvg ranging from 2.6-5.35 makes 50% of customers taking loan.

In [231]:
sns.distplot(df[df['Personal Loan']==0]['CCAvg'],kde=False,label='Loan 0',rug=True);
sns.distplot(df[df['Personal Loan']==1]['CCAvg'],kde=False,label='Loan 1',rug=True);

plt.legend();
plt.show();
In [232]:
#Education
In [233]:
print(df[df['Personal Loan']==1].Education.describe())
print(df[df['Personal Loan']==0].Education.describe())
count    480.000000
mean       2.233333
std        0.753373
min        1.000000
25%        2.000000
50%        2.000000
75%        3.000000
max        3.000000
Name: Education, dtype: float64
count    4520.000000
mean        1.843584
std         0.839975
min         1.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         3.000000
Name: Education, dtype: float64
In [234]:
sns.boxplot(data=df,x='Education',y='Income')
print('Mean average seems slightly higher for Undergrads')
Mean average seems slightly higher for Undergrads
In [235]:
sns.boxplot(data=df,x='Education',y='Income',hue='Personal Loan')
print('Personal Loan is more choosen by education group 2 and 3')
Personal Loan is more choosen by education group 2 and 3
In [236]:
sns.countplot(data=df,x='Education',hue='Personal Loan')
plt.show();
print('''
Personal Loan customer presence is more in education group 3 and 2 compared to 1.
There is higher chance of conversion rate if the cust is graduate/Advanced/Professional education level
''')
Personal Loan customer presence is more in education group 3 and 2 compared to 1.
There is higher chance of conversion rate if the cust is graduate/Advanced/Professional education level

In [237]:
#Mortgage
In [238]:
print(df[df['Personal Loan']==1].Mortgage.describe())
print(df[df['Personal Loan']==0].Mortgage.describe())
count    480.000000
mean     100.845833
std      160.847862
min        0.000000
25%        0.000000
50%        0.000000
75%      192.500000
max      617.000000
Name: Mortgage, dtype: float64
count    4520.000000
mean       51.789381
std        92.038931
min         0.000000
25%         0.000000
50%         0.000000
75%        98.000000
max       635.000000
Name: Mortgage, dtype: float64
In [239]:
plt.figure(figsize=(16,8))
sns.distplot(df[df['Personal Loan']==1]['Mortgage'],kde=False,label='Loan 1');
sns.distplot(df[df['Personal Loan']==0]['Mortgage'],kde=False,label='Loan 0');

plt.legend();
plt.show();
In [240]:
#Eliminating the mortgage 0 as its making the view very difficult to read
plt.figure(figsize=(16,8))
sns.distplot(df[(df['Personal Loan']==0) & (df['Mortgage']>0) ]['Mortgage'],kde=False,label='Loan 0',rug=True);
sns.distplot(df[(df['Personal Loan']==1) & (df['Mortgage']>0) ]['Mortgage'],kde=False,label='Loan 1',rug=True);


plt.legend();
plt.show();
print('''
Higher chance of conversion to Loan for cust having Mortgage above 280/300K
(from 50% in below table)
''')
Higher chance of conversion to Loan for cust having Mortgage above 280/300K
(from 50% in below table)

In [241]:
print(df[(df['Personal Loan']==1) & (df['Mortgage']>0)].Mortgage.describe())
print(df[(df['Personal Loan']==0) & (df['Mortgage']>0)].Mortgage.describe())
print('''
Customers having no Mortage have very likelihood of taking PErsonal Loan.
For other groups of having varying range of mortages have mostly steady distribution.
Comparatively customers having mortgages having more than approx. 280K have higher chances
of accepting personal loan (may be because of liqiudity crunch due to higher mortage)
''')
count    168.000000
mean     288.130952
std      141.145466
min       75.000000
25%      174.000000
50%      282.000000
75%      373.250000
max      617.000000
Name: Mortgage, dtype: float64
count    1370.000000
mean      170.867153
std        87.186843
min        75.000000
25%       106.000000
50%       146.000000
75%       212.000000
max       635.000000
Name: Mortgage, dtype: float64

Customers having no Mortage have very likelihood of taking PErsonal Loan.
For other groups of having varying range of mortages have mostly steady distribution.
Comparatively customers having mortgages having more than approx. 280K have higher chances
of accepting personal loan (may be because of liqiudity crunch due to higher mortage)

In [242]:
#Securities Account
In [243]:
print(df[df['Personal Loan']==1]['Securities Account'].describe())
print(df[df['Personal Loan']==0]['Securities Account'].describe())
print("No relation can be inferred")
count    480.000000
mean       0.125000
std        0.331064
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: Securities Account, dtype: float64
count    4520.000000
mean        0.102212
std         0.302961
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: Securities Account, dtype: float64
No relation can be inferred
In [244]:
sns.boxplot(data=df,x='Securities Account',y='Income',hue='Personal Loan')
print("Distribution looks even in both groups")
Distribution looks even in both groups
In [ ]:
 
In [245]:
df['Securities Account'].value_counts()
Out[245]:
0    4478
1     522
Name: Securities Account, dtype: int64
In [246]:
df[df['Personal Loan']==1]['Securities Account'].value_counts()
Out[246]:
0    420
1     60
Name: Securities Account, dtype: int64
In [247]:
df[df['Personal Loan']==0]['Securities Account'].value_counts()
Out[247]:
0    4058
1     462
Name: Securities Account, dtype: int64
In [248]:
print('Percentage of conversin for people having sec account - '+str((60/522)*100))
print('Percentage of conversin for people not having sec account - '+str((420/4478)*100))
Percentage of conversin for people having sec account - 11.494252873563218
Percentage of conversin for people not having sec account - 9.379187137114783
In [249]:
sns.countplot(data=df,x='Securities Account',hue='Personal Loan');
In [250]:
ax=sns.countplot(data=df,x='Securities Account',hue='Personal Loan')
total = float(len(df))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height/total),
            ha="center")
#over all percentage on top of bars #code referenced from https://stackoverflow.com/questions/31749448/how-to-add-percentages-on-top-of-bars-in-seaborn
plt.show();
print('''
Customers not having securities loan have higher conversion but the number of 
cust not having Sec Account is also higher.
Not able to gather much relation from the distribution on securities account''')
Customers not having securities loan have higher conversion but the number of 
cust not having Sec Account is also higher.
Not able to gather much relation from the distribution on securities account
In [251]:
#code referenced from https://stackoverflow.com/questions/31749448/how-to-add-percentages-on-top-of-bars-in-seaborn
def with_hue(plot, feature, Number_of_categories, hue_categories):
    a = [p.get_height() for p in plot.patches]
    patch = [p for p in plot.patches]
    for i in range(Number_of_categories):
        total = feature.value_counts().values[i]
        for j in range(hue_categories):
            percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
            x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
            y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height() 
            ax.annotate(percentage, (x, y), size = 12)
    plt.show()
    
In [252]:
ax=sns.countplot(data=df,x='Securities Account',hue='Personal Loan')
with_hue(ax,df['Securities Account'],2,2)
In [253]:
pd.crosstab(df['Securities Account'],df['Personal Loan'])
Out[253]:
Personal Loan 0 1
Securities Account
0 4058 420
1 462 60
In [254]:
print('Statiscally people having securietes account have more chances of conversion.')
Statiscally people having securietes account have more chances of conversion.
In [255]:
#CD Account
In [256]:
pd.crosstab(df['CD Account'],df['Personal Loan'])
Out[256]:
Personal Loan 0 1
CD Account
0 4358 340
1 162 140
In [257]:
ax=sns.countplot(data=df,x='CD Account',hue='Personal Loan')
with_hue(ax,df['CD Account'],2,2)
In [258]:
print('Percentage of conversin for people having CD account - '+str((140/(140+162))*100))
print('Percentage of conversin for people not having CD account - '+str((340/(4358+340))*100))
Percentage of conversin for people having CD account - 46.35761589403973
Percentage of conversin for people not having CD account - 7.237122179650915
In [259]:
print('Visually & Statiscally people having CD account have higher chances of conversion for Loan.')
Visually & Statiscally people having CD account have higher chances of conversion for Loan.
In [260]:
# Online
In [261]:
pd.crosstab(df['Online'],df['Personal Loan'])
Out[261]:
Personal Loan 0 1
Online
0 1827 189
1 2693 291
In [262]:
ax=sns.countplot(data=df,x='Online',hue='Personal Loan')
#with_hue(ax,df['Online'],2,2)
In [263]:
print('Percentage of conversin for people having online account - '+str((291/(2693+291))*100))
print('Percentage of conversin for people not having online account - '+str((189/(1827+189))*100))
Percentage of conversin for people having online account - 9.75201072386059
Percentage of conversin for people not having online account - 9.375
In [264]:
print("Online doesnt show a impact on the loan acceptance")
Online doesnt show a impact on the loan acceptance
In [265]:
#Credit Card
In [266]:
pd.crosstab(df['CreditCard'],df['Personal Loan'])
Out[266]:
Personal Loan 0 1
CreditCard
0 3193 337
1 1327 143
In [267]:
ax=sns.countplot(data=df,x='CreditCard',hue='Personal Loan')
with_hue(ax,df['CreditCard'],2,2)
In [268]:
print('Percentage of conversin for people having CC - '+str((143/(143+1327))*100))
print('Percentage of conversin for people not having CC - '+str((337/(337+3193))*100))
Percentage of conversin for people having CC - 9.727891156462585
Percentage of conversin for people not having CC - 9.546742209631729
In [269]:
print("No impact of having CC")
No impact of having CC
In [ ]:
 
In [270]:
#Getting Target column
In [271]:
print(pd.DataFrame(df['Personal Loan']).info())
print()
print(df['Personal Loan'].value_counts())
print('''\n
Total 5000 records out of which only 480 customers have accepted the Loan.
''')
print("\nOut of total " + str(df['Personal Loan'].count()) + ", percentage customers opted for loan are: "+ str(df['Personal Loan'].value_counts(1)[1]*100))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Personal Loan  5000 non-null   int64
dtypes: int64(1)
memory usage: 39.2 KB
None

0    4520
1     480
Name: Personal Loan, dtype: int64


Total 5000 records out of which only 480 customers have accepted the Loan.


Out of total 5000, percentage customers opted for loan are: 9.6
In [272]:
df['Personal Loan'].value_counts(1)
Out[272]:
0    0.904
1    0.096
Name: Personal Loan, dtype: float64
In [273]:
print(df['Personal Loan'].value_counts())
print("\nOut of total " + str(df['Personal Loan'].count()) + ", percentage customers opted for loan are: "+ str(df['Personal Loan'].value_counts(1)[1]*100))
0    4520
1     480
Name: Personal Loan, dtype: int64

Out of total 5000, percentage customers opted for loan are: 9.6
In [ ]:
 
In [274]:
sns.countplot(data=df,x='Personal Loan')
    
plt.show()
In [275]:
#graphs credit - Referenced from https://www.datacamp.com/community/tutorials/categorical-data
labels = df['Personal Loan'].astype('category').cat.categories.tolist()
counts = df['Personal Loan'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True) #autopct is show the % on plot
ax1.axis('equal')
plt.show()
print('''There is a huge gap between customers accepting the Loan and total customers,
class distribution is severely skewed.
This may results in models that have poor predictive performance, 
specifically for the minority class.''')
#https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034358&utm_targetid=aud-392016246653:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9061996&gclid=Cj0KCQjwvvj5BRDkARIsAGD9vlKQoq5lRfZhmcZAwNSvWsmJM1EIepUab4d5F2WH24kIiOE2Gt7oA3QaApJ2EALw_wcB
There is a huge gap between customers accepting the Loan and total customers,
class distribution is severely skewed.
This may results in models that have poor predictive performance, 
specifically for the minority class.
In [276]:
plt.figure(figsize=(16,8))
sns.heatmap(df.corr(), annot=True)
Out[276]:
<AxesSubplot:>
In [277]:
df.corr()['Personal Loan'].sort_values(ascending=False)
Out[277]:
Personal Loan         1.000000
Income                0.502462
CCAvg                 0.366889
CD Account            0.316355
Mortgage              0.142095
Education             0.136722
Family                0.061367
Securities Account    0.021954
Online                0.006278
CreditCard            0.002802
ZIP Code              0.000107
Age                  -0.007726
Experience           -0.007983
ID                   -0.024801
Name: Personal Loan, dtype: float64
In [278]:
print('''
As analyzed above, Income, CCAvg has highest linear correlation with Loan.
Age and Experience are highly correlated but Experience is more negatively correlated with Loan then Age.
ZIP Code has least correlation with target along with CreditCard and Online
''')
As analyzed above, Income, CCAvg has highest linear correlation with Loan.
Age and Experience are highly correlated but Experience is more negatively correlated with Loan then Age.
ZIP Code has least correlation with target along with CreditCard and Online

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [279]:
#Split the data into training and test set in the ratio of 70:30 respectively
In [280]:
#Testing with majority data with no changes, except ID and Age (Age and Experience are highly correlated and dropping one will reduce the complexity)
In [281]:
X=df.drop(['ID','Age','Personal Loan'],axis=1)
y=df['Personal Loan']
In [282]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=5)
In [283]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3500 entries, 2015 to 2915
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Experience          3500 non-null   int64  
 1   Income              3500 non-null   int64  
 2   ZIP Code            3500 non-null   int64  
 3   Family              3500 non-null   int64  
 4   CCAvg               3500 non-null   float64
 5   Education           3500 non-null   int64  
 6   Mortgage            3500 non-null   int64  
 7   Securities Account  3500 non-null   int64  
 8   CD Account          3500 non-null   int64  
 9   Online              3500 non-null   int64  
 10  CreditCard          3500 non-null   int64  
dtypes: float64(1), int64(10)
memory usage: 328.1 KB
In [284]:
print("Train and Test were splitted into 70:30")
Train and Test were splitted into 70:30
In [285]:
y_train.value_counts(1)
Out[285]:
0    0.905429
1    0.094571
Name: Personal Loan, dtype: float64
In [286]:
y_test.value_counts(1)
Out[286]:
0    0.900667
1    0.099333
Name: Personal Loan, dtype: float64
In [287]:
print("Train and Test data sets have similar distribution of target variable")
Train and Test data sets have similar distribution of target variable
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [288]:
#Use different classification models (Logistic, K-NN and Naïve Bayes) to predict the likelihood of a customer buying personal loans
#Print the confusion matrix for all the above models
In [289]:
#logistic regression
In [290]:
from sklearn.linear_model import LogisticRegression
logRegModel=LogisticRegression()
logRegModel.fit(X_train,y_train)
y_predict=logRegModel.predict(X_test)

from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,f1_score,precision_score,roc_curve,log_loss,auc
print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

modelComp=pd.DataFrame({'Model':['Logistic Regression - 0.5'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]})
Accuracy score: 0.912
confuion matrix:
 [[1319   32]
 [ 100   49]]
Recall Score:  0.3288590604026846
Precission Score:  0.6049382716049383
F1 Score:  0.42608695652173917
In [291]:
#Logistic Regression with default thresold - Recall - 33% and Accuracy - 91.2%
In [292]:
#referenced from training materials
from sklearn import metrics
def draw_cm( actual, predicted ):
    cm = metrics.confusion_matrix( actual, predicted, [0,1] )
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["Loan 0", "Loan 1"] , yticklabels = ["Loan 0", "Loan 1"] )
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
In [293]:
draw_cm(y_test, y_predict)
In [294]:
from sklearn.preprocessing import binarize
#changing the threshold to 0.3
y_pred_class = binarize([logRegModel.predict_proba(X_test)[:, 1]], 0.3)[0]

print('Accuracy score:',accuracy_score(y_test,y_pred_class))
print('confuion matrix:\n',confusion_matrix(y_test,y_pred_class))
print('Recall Score: ',recall_score(y_test, y_pred_class))
print('Precission Score: ',precision_score(y_test, y_pred_class))
print('F1 Score: ',f1_score(y_test, y_pred_class))
draw_cm(y_test, y_pred_class)
modelComp=modelComp.append(pd.DataFrame({'Model':['Logistic Regression - 0.3'],'Accuracy':[accuracy_score(y_test,y_pred_class)*100],'Precission':[precision_score(y_test, y_pred_class)*100],'Recall':[recall_score(y_test, y_pred_class)*100]}))
Accuracy score: 0.8986666666666666
confuion matrix:
 [[1268   83]
 [  69   80]]
Recall Score:  0.5369127516778524
Precission Score:  0.49079754601226994
F1 Score:  0.5128205128205129
In [ ]:
 
In [295]:
#Logistic Regression with threshold decrease to .3 Recall increased to 53.7% and Accuracy - 89.9%
In [296]:
#referenced from training materials
y_pred_proba = logRegModel.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

Logistic Regression:

  • We observe the over all accuracy is hovering around 90% for both threshold levels but recall rate of 33% is very poor.
  • For predicting customer who could be converted towards taking personal Loan we have to focus more on Recall then Preceission/Accuracy as we dont want to miss any customers who could take a Loan with some compromixe on overall accuracy/precission.
  • Post dropping the threshold from default to 0.3 we can observe the recall rate increased from 33% to 54%. Still this means we are almost missing half of potential customers. We have almost missed 69 out of 149 total potential customers.
  • Accuracy also got compramised but its just dropped from 91% to 89% which is acceptable compared to % increase in Recall.
  • Precission is hovering around 50%-60% which is also not a good sign.
In [ ]:
 
In [297]:
#Stats Model Logit
In [298]:
#!pip install statsmodels
In [299]:
import statsmodels.api as sm
logit = sm.Logit( y_train, sm.add_constant( X_train ) )
lg = logit.fit()
#lg.summary2()
Optimization terminated successfully.
         Current function value: 0.127296
         Iterations 9
In [300]:
y_predict=pd.DataFrame(lg.predict( sm.add_constant( X_test ) ))
#print(y_predict[0:5])
#y_predict.info()
In [ ]:
 
In [301]:
#referenced from traing materials
def get_predictions( y_test, X_test,model ):
    y_pred_df = pd.DataFrame( { 'actual': y_test,
                               "predicted_prob": model.predict( sm.add_constant( X_test ) ) } )
    return y_pred_df
In [302]:
y_pred_df = get_predictions(y_test,X_test, lg )
y_pred_df.head()
Out[302]:
actual predicted_prob
27 0 0.020818
1482 0 0.000018
3021 0 0.127017
3867 0 0.016468
637 0 0.014825
In [303]:
y_pred_df['predicted'] = y_pred_df.predicted_prob.map( lambda x: 1 if x > 0.3 else 0)
y_pred_df.head()
Out[303]:
actual predicted_prob predicted
27 0 0.020818 0
1482 0 0.000018 0
3021 0 0.127017 0
3867 0 0.016468 0
637 0 0.014825 0
In [304]:
y_predict=y_pred_df['predicted']
In [305]:
#y_predict = y_predict.apply( lambda x: 1 if x > 0.6 else 0)
In [306]:
print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
Accuracy score: 0.9406666666666667
confuion matrix:
 [[1303   48]
 [  41  108]]
Recall Score:  0.7248322147651006
Precission Score:  0.6923076923076923
F1 Score:  0.7081967213114754
In [307]:
#Logit Recall - 72.5% and Accuracy - 94.07%
In [308]:
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Logit -StatsModel - 0.3'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
In [309]:
y_pred_proba = y_pred_df.predicted_prob
[fpr1, tpr1, thr1] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr1, tpr1, color='c', label='ROC curve Logit (area = %0.2f)' % auc(fpr1, tpr1))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()

Logit StatsModel:

  • We observe the over all accuracy is around 94% and Recall - 72.5%
  • Keeping the threshold at 0.3 (Considered from previous run)
  • Comparing the ROC curve also shows it covers more area compared to LogisticRegression with its default settings
  • Precisiion is hovering around 70%
In [ ]:
 
In [310]:
#KNN
In [311]:
from sklearn.neighbors import KNeighborsClassifier
KnnModel = KNeighborsClassifier(n_neighbors=3)
KnnModel.fit(X_train,y_train)
y_predict=KnnModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['KNN - 3 Neigbours'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.9053333333333333
confuion matrix:
 [[1309   42]
 [ 100   49]]
Recall Score:  0.3288590604026846
Precission Score:  0.5384615384615384
F1 Score:  0.4083333333333334
In [312]:
#KNN recall - 32.9% Accuracy - 90.5%
In [ ]:
 
In [313]:
y_pred_proba = KnnModel.predict_proba(X_test)[:, 1]
[fpr2, tpr2, thr2] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr1, tpr1, color='c', label='ROC curve Logit (area = %0.2f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()

KNN

  • Considering KNN with 3 neighbours, Accuracy - 90.5% and recall - 33%
  • Precission - 54%
  • this model has very low recall rate, its only able to predict 49 out of 149 potential customers for Loan.
  • Srea under ROC is as well way low compared to Logistic Regression & Logit
  • This is slightly expected as KNN being based on distance and as our data is not scaled, we might be getting biases from variables having higher magnitudes.
In [ ]:
 
In [ ]:
 
In [314]:
#Naive bayes - GaussianNB
In [315]:
from sklearn.naive_bayes import GaussianNB,BernoulliNB
NBGauModel = GaussianNB()

NBGauModel.fit(X_train,y_train)
y_predict=NBGauModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Naive Bayes - Gaussian'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8813333333333333
confuion matrix:
 [[1239  112]
 [  66   83]]
Recall Score:  0.5570469798657718
Precission Score:  0.4256410256410256
F1 Score:  0.4825581395348837
In [316]:
#NB Gaussian - Recall - 55.7% Accuracy - 88.13%
In [317]:
y_pred_proba = NBGauModel.predict_proba(X_test)[:, 1]
[fpr3, tpr3, thr3] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr1, tpr1, color='c', label='ROC curve Logit (area = %0.2f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot(fpr3, tpr3, color='b', label='ROC curve NaiveBayes (area = %0.2f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()

Naive Bayes - GaussianNB

  • Accuracy - 88%, Recall - 56%,Precission - 43%
  • With all default setting as well Recall rate is better then Logistic and KNN
  • Area under ROC is also comparable to Logistc Regression
  • Able to predict 83 out of 149 potential customers
In [318]:
# BernoulliNB
In [319]:
NBBernoulliModel = BernoulliNB()

NBBernoulliModel.fit(X_train,y_train)
y_predict=NBBernoulliModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp=modelComp.append(pd.DataFrame({'Model':['Naive Bayes - Bernoulli'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
Accuracy score: 0.8873333333333333
confuion matrix:
 [[1314   37]
 [ 132   17]]
Recall Score:  0.11409395973154363
Precission Score:  0.3148148148148148
F1 Score:  0.16748768472906406
In [320]:
#NB Bernoulli  - Recall - 11.4% Accuracy - 88.7%
In [321]:
y_pred_proba = NBBernoulliModel.predict_proba(X_test)[:, 1]
[fpr4, tpr4, thr4] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr1, tpr1, color='c', label='ROC curve Logit (area = %0.2f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot(fpr3, tpr3, color='b', label='ROC curve NaiveBayes (area = %0.2f)' % auc(fpr3, tpr3))
plt.plot(fpr4, tpr4, color='r', label='ROC curve Bernoulli (area = %0.2f)' % auc(fpr4, tpr4))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()

Naive Bayes

  • Accuracy - 88.7% and Recall - 11.4%
  • This model is very poor in predicting the potential customers.
In [ ]:
 
In [322]:
#Give your reasoning on which is the best model in this case and why it performs better
In [323]:
y_pred_proba = NBBernoulliModel.predict_proba(X_test)[:, 1]
[fpr4, tpr4, thr4] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr1, tpr1, color='c', label='ROC curve Logit (area = %0.2f)' % auc(fpr1, tpr1))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot(fpr3, tpr3, color='b', label='ROC curve NaiveBayes (area = %0.2f)' % auc(fpr3, tpr3))
plt.plot(fpr4, tpr4, color='r', label='ROC curve Bernoulli (area = %0.2f)' % auc(fpr4, tpr4))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()
In [324]:
modelComp
Out[324]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 91.200000 60.493827 32.885906
0 Logistic Regression - 0.3 89.866667 49.079755 53.691275
0 Logit -StatsModel - 0.3 94.066667 69.230769 72.483221
0 KNN - 3 Neigbours 90.533333 53.846154 32.885906
0 Naive Bayes - Gaussian 88.133333 42.564103 55.704698
0 Naive Bayes - Bernoulli 88.733333 31.481481 11.409396

Comparing Logistic Rgression, KNN and Naive Bayes

  • From above comparison table details we can clearly notice with default parameters NB-Gaussian is able to perform/predict better compared to KNN or Logistic Regression(default). As per AUC-ROC as well NB and Logistic regression are very close.
  • Never the less when we change the threshold value of Logistic regression we were able to get the Recall rate near to NB.
  • Our main goal is to identify potential customers who could be converted for chosing Loan and for that we need a model which can show better recall rate without compromsing Accuracy.
  • Considering those two factors only NB - Gaussian seems like a better choice among Logistic, KNN and NB.
  • Naive Bayes assumes that the features are conditionally independent. In our data set also majority of the attributes/columns are independent. We have removed Age which was highly correlated with Experience and rest other variables has very less depndency on each other.
In [325]:
# Additional theory details on NB and Logistic Regression referenced from - https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c#:~:text=Naive%20Bayes%20also%20assumes%20that,will%20be%20a%20better%20classifier.
In [326]:
#modelComp.drop(modelComp.index, inplace=True)
In [ ]:
 
In [327]:
# Next Approach to reduce few less significant columns from data set and captuer performance on model (reduce complexity)
In [328]:
#X=df.drop(['ID','Age','Personal Loan'],axis=1)
X=df.drop(['ID','Personal Loan','Age','CreditCard','Online','ZIP Code'],axis=1)
#X=df.drop(['ID','Personal Loan','Age','Online','ZIP Code'],axis=1)
y=df['Personal Loan']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=5)
In [329]:
#Logistic Regression
In [330]:
from sklearn.linear_model import LogisticRegression
logRegModel=LogisticRegression(max_iter=1000)
logRegModel.fit(X_train,y_train)
y_predict=logRegModel.predict(X_test)

from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,f1_score,precision_score,roc_curve,log_loss,auc
print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

modelComp2=pd.DataFrame({'Model':['Logistic Regression - 0.5'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]})

draw_cm(y_test, y_predict)
Accuracy score: 0.9453333333333334
confuion matrix:
 [[1329   22]
 [  60   89]]
Recall Score:  0.5973154362416108
Precission Score:  0.8018018018018018
F1 Score:  0.6846153846153847
In [331]:
#changing the threshold to 0.3
y_pred_class = binarize([logRegModel.predict_proba(X_test)[:, 1]], 0.3)[0]

print('Accuracy score:',accuracy_score(y_test,y_pred_class))
print('confuion matrix:\n',confusion_matrix(y_test,y_pred_class))
print('Recall Score: ',recall_score(y_test, y_pred_class))
print('Precission Score: ',precision_score(y_test, y_pred_class))
print('F1 Score: ',f1_score(y_test, y_pred_class))
draw_cm(y_test, y_pred_class)
modelComp2=modelComp2.append(pd.DataFrame({'Model':['Logistic Regression - 0.3'],'Accuracy':[accuracy_score(y_test,y_pred_class)*100],'Precission':[precision_score(y_test, y_pred_class)*100],'Recall':[recall_score(y_test, y_pred_class)*100]}))
Accuracy score: 0.9386666666666666
confuion matrix:
 [[1300   51]
 [  41  108]]
Recall Score:  0.7248322147651006
Precission Score:  0.6792452830188679
F1 Score:  0.7012987012987013
In [332]:
y_pred_proba = logRegModel.predict_proba(X_test)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [ ]:
 
In [333]:
#KNN
In [334]:
from sklearn.neighbors import KNeighborsClassifier
KnnModel = KNeighborsClassifier(n_neighbors=3)
KnnModel.fit(X_train,y_train)
y_predict=KnnModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

draw_cm(y_test, y_predict)
modelComp2=modelComp2.append(pd.DataFrame({'Model':['KNN - 3 Neigbours'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
y_pred_proba = KnnModel.predict_proba(X_test)[:, 1]
[fpr2, tpr2, thr2] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()
Accuracy score: 0.9113333333333333
confuion matrix:
 [[1308   43]
 [  90   59]]
Recall Score:  0.3959731543624161
Precission Score:  0.5784313725490197
F1 Score:  0.47011952191235057
In [335]:
# NB - Gaussian
In [336]:
from sklearn.naive_bayes import GaussianNB,BernoulliNB
NBGauModel = GaussianNB()

NBGauModel.fit(X_train,y_train)
y_predict=NBGauModel.predict(X_test)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp2=modelComp2.append(pd.DataFrame({'Model':['Naive Bayes - Gaussian'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))

y_pred_proba = NBGauModel.predict_proba(X_test)[:, 1]
[fpr3, tpr3, thr3] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot(fpr3, tpr3, color='b', label='ROC curve NaiveBayes (area = %0.2f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()
Accuracy score: 0.8753333333333333
confuion matrix:
 [[1231  120]
 [  67   82]]
Recall Score:  0.5503355704697986
Precission Score:  0.40594059405940597
F1 Score:  0.4672364672364672
In [337]:
modelComp
Out[337]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 91.200000 60.493827 32.885906
0 Logistic Regression - 0.3 89.866667 49.079755 53.691275
0 Logit -StatsModel - 0.3 94.066667 69.230769 72.483221
0 KNN - 3 Neigbours 90.533333 53.846154 32.885906
0 Naive Bayes - Gaussian 88.133333 42.564103 55.704698
0 Naive Bayes - Bernoulli 88.733333 31.481481 11.409396
In [338]:
modelComp2
Out[338]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 94.533333 80.180180 59.731544
0 Logistic Regression - 0.3 93.866667 67.924528 72.483221
0 KNN - 3 Neigbours 91.133333 57.843137 39.597315
0 Naive Bayes - Gaussian 87.533333 40.594059 55.033557

Model Comparison - Logistic Regression, KNN, NB

  • There is a significant improvement in recall rate and accuracy of all the models once we dropped few less significant columns
  • with current model Logistic regression is able to predict 108 out of 149 potential customers. Logistic Regression will be better choice in this configuration
  • KNN's performance is also increased but to a very slight level
  • NB didnt showed any significant improvementat all on Recall rate
  • With current identified columns we can take Logistic regression with 0.3 threshold as best performing model with Recall rate of 73% and Accuracy - 94%
  • Logistic regression measures the relationship between a output variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. By reducing the complexity/variables in the eqations, model is able to predict better. BUt NB reamins at same level as Naive Bayes figures out how the data was generated given the results.
In [339]:
# Additional theory details on NB and Logistic Regression referenced from - https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c#:~:text=Naive%20Bayes%20also%20assumes%20that,will%20be%20a%20better%20classifier.
In [ ]:
 
In [ ]:
 
In [340]:
# Next - scaling the data - as values in the columns are on very diff scale then each other
In [341]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
In [342]:
#Logistic Regression
In [343]:
from sklearn.linear_model import LogisticRegression
logRegModel=LogisticRegression()
logRegModel.fit(X_train_scaled,y_train)
y_predict=logRegModel.predict(X_test_scaled)

from sklearn.metrics import accuracy_score,confusion_matrix,recall_score,f1_score,precision_score,roc_curve,log_loss,auc
print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

modelComp3=pd.DataFrame({'Model':['Logistic Regression - 0.5'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]})

draw_cm(y_test, y_predict)
Accuracy score: 0.9406666666666667
confuion matrix:
 [[1329   22]
 [  67   82]]
Recall Score:  0.5503355704697986
Precission Score:  0.7884615384615384
F1 Score:  0.6482213438735177
In [344]:
#changing the threshold to 0.3
y_pred_class = binarize([logRegModel.predict_proba(X_test_scaled)[:, 1]], 0.3)[0]

print('Accuracy score:',accuracy_score(y_test,y_pred_class))
print('confuion matrix:\n',confusion_matrix(y_test,y_pred_class))
print('Recall Score: ',recall_score(y_test, y_pred_class))
print('Precission Score: ',precision_score(y_test, y_pred_class))
print('F1 Score: ',f1_score(y_test, y_pred_class))
draw_cm(y_test, y_pred_class)
modelComp3=modelComp3.append(pd.DataFrame({'Model':['Logistic Regression - 0.3'],'Accuracy':[accuracy_score(y_test,y_pred_class)*100],'Precission':[precision_score(y_test, y_pred_class)*100],'Recall':[recall_score(y_test, y_pred_class)*100]}))
Accuracy score: 0.9353333333333333
confuion matrix:
 [[1294   57]
 [  40  109]]
Recall Score:  0.7315436241610739
Precission Score:  0.6566265060240963
F1 Score:  0.692063492063492
In [345]:
y_pred_proba = logRegModel.predict_proba(X_test_scaled)[:, 1]
[fpr, tpr, thr] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, color='coral', label='ROC curve (area = %0.3f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
In [ ]:
 
In [346]:
#KNN
In [347]:
from sklearn.neighbors import KNeighborsClassifier
KnnModel = KNeighborsClassifier(n_neighbors=3)
KnnModel.fit(X_train_scaled,y_train)
y_predict=KnnModel.predict(X_test_scaled)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))

draw_cm(y_test, y_predict)
modelComp3=modelComp3.append(pd.DataFrame({'Model':['KNN - 3 Neigbours'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))
y_pred_proba = KnnModel.predict_proba(X_test_scaled)[:, 1]
[fpr2, tpr2, thr2] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()
Accuracy score: 0.9646666666666667
confuion matrix:
 [[1345    6]
 [  47  102]]
Recall Score:  0.6845637583892618
Precission Score:  0.9444444444444444
F1 Score:  0.7937743190661479
In [348]:
# NB - Gaussian
In [349]:
from sklearn.naive_bayes import GaussianNB,BernoulliNB
NBGauModel = GaussianNB()

NBGauModel.fit(X_train_scaled,y_train)
y_predict=NBGauModel.predict(X_test_scaled)

print('Accuracy score:',accuracy_score(y_test,y_predict))
print('confuion matrix:\n',confusion_matrix(y_test,y_predict))
print('Recall Score: ',recall_score(y_test, y_predict))
print('Precission Score: ',precision_score(y_test, y_predict))
print('F1 Score: ',f1_score(y_test, y_predict))
draw_cm(y_test, y_predict)
modelComp3=modelComp3.append(pd.DataFrame({'Model':['Naive Bayes - Gaussian'],'Accuracy':[accuracy_score(y_test,y_predict)*100],'Precission':[precision_score(y_test, y_predict)*100],'Recall':[recall_score(y_test, y_predict)*100]}))

y_pred_proba = NBGauModel.predict_proba(X_test_scaled)[:, 1]
[fpr3, tpr3, thr3] = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.clf()
plt.plot(fpr, tpr, color='coral', label='ROC curve LogReg (area = %0.2f)' % auc(fpr, tpr))
plt.plot(fpr2, tpr2, color='g', label='ROC curve KNN (area = %0.2f)' % auc(fpr2, tpr2))
plt.plot(fpr3, tpr3, color='b', label='ROC curve NaiveBayes (area = %0.2f)' % auc(fpr3, tpr3))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiverrating characteristic example')
plt.legend(loc="lower right")
plt.show()
Accuracy score: 0.874
confuion matrix:
 [[1217  134]
 [  55   94]]
Recall Score:  0.6308724832214765
Precission Score:  0.41228070175438597
F1 Score:  0.49867374005305043
In [350]:
#Default
modelComp
Out[350]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 91.200000 60.493827 32.885906
0 Logistic Regression - 0.3 89.866667 49.079755 53.691275
0 Logit -StatsModel - 0.3 94.066667 69.230769 72.483221
0 KNN - 3 Neigbours 90.533333 53.846154 32.885906
0 Naive Bayes - Gaussian 88.133333 42.564103 55.704698
0 Naive Bayes - Bernoulli 88.733333 31.481481 11.409396
In [351]:
#Removed few columns to reduce complexity
modelComp2
Out[351]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 94.533333 80.180180 59.731544
0 Logistic Regression - 0.3 93.866667 67.924528 72.483221
0 KNN - 3 Neigbours 91.133333 57.843137 39.597315
0 Naive Bayes - Gaussian 87.533333 40.594059 55.033557
In [352]:
#Scaled data
modelComp3
Out[352]:
Model Accuracy Precission Recall
0 Logistic Regression - 0.5 94.066667 78.846154 55.033557
0 Logistic Regression - 0.3 93.533333 65.662651 73.154362
0 KNN - 3 Neigbours 96.466667 94.444444 68.456376
0 Naive Bayes - Gaussian 87.400000 41.228070 63.087248
In [ ]:
 

Model Comparison

  • Noticed Significant improvement in KNN and NB
  • No improvement noticed for Logistic Regression
  • Considering the scalled data with reduced columns we can notice the KNN has a significantly high rate for Accuracy, Precission. Recall rate is very close to Logistic Regression with 0.3 threshold.
  • KNN model is able to predict 102 out of 149 where as Logistic regression with 0.3 threshold predicted 109 out of 149 potential customers.
  • Considering the Accuracy where KNN only had 6 values misclassifed from Loan 0 to Loan 1 class, compared to 57 in Logistic regression.
  • Considering above parameters we can suggest KNN would work better if we have scalled data compared to other models.
  • KNN is a distance based algorithm, it chooses the k closest neighbors and then based on these neighbors, assigns a class or predicts a value for new observation. Hence it was affected by scale of the variables. If the data is not scaled it will give higher weightage to those variables which has high magnitude.
In [353]:
# KNN theory referenced from - https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7
In [ ]:
 
In [354]:
#End OF File